11 research outputs found

    Pembangunan repositori warisan budaya berasaskan ontologi

    Get PDF
    Malaysia adalah sebuah negara yang kaya dengan warisan budaya tradisional. Ini dapat dilihat daripada pelbagai amalan dan cara hidup masyarakatnya. Namun begitu, warisan budaya tradisional Malaysia semakin lenyap ditelan arus pemodenan. Maklumat berkaitan dengan warisan budaya boleh diperoleh dari pelbagai sumber dan lokasi, namun pengurusan maklumat yang tidak cekap menyebab maklumat tidak tekal dan menyukar proses pencariannya. Sebuah repositori warisan budaya dibangun bagi mengatasi masalah ini. Repositori tersebut dibangun berasaskan ontologi bagi memboleh pencarian dan perwakilan pengetahuan secara semantik. Ontologi ini dibangun berdasarkan objek ketara yang didaftar sebagai objek warisan kebangsaan di Jabatan Warisan Negara (JWN). Maklumat tambahan turut dikumpul dari pelbagai sumber termasuk Perpustakaan Negara, Muzium Negara, buku sejarah, jurnal, artikel dan rujukan dalam talian. Kategori bagi setiap objek warisan budaya dikelas mengguna pendekatan middle-out. Sebuah prototaip sistem capaian maklumat warisan budaya dibangun bagi menguji keberkesanan pengurusan dan capaian maklumat repositori warisan budaya. Pengujian ke atas prototaip sistem tersebut mendapati penggunaan ontologi bagi mengurus maklumat dalam repositori warisan budaya meningkat kualiti carian maklumat warisan tidak ketara. Repositori Warisan budaya ini boleh diperluas sebagai hab utama pengurusan dan capaian maklumat warisan budaya kebangsaan

    Arabic Rule-Based Named Entity Recognition Systems Progress and Challenges

    Get PDF
    Rule-based approaches are using human-made rules to extract Named Entities (NEs), it is one of the most famous ways to extract NE as well as Machine Learning.  The term Named Entity Recognition (NER) is defined as a task determined to indicate personal names, locations, organizations and many other entities. In Arabic language, Big Data challenges make Arabic NER develops rapidly and extracts useful information from texts. The current paper sheds some light on research progress in rule-based via a diagnostic comparison among linguistic resource, entity type, domain, and performance. We also highlight the challenges of the processing Arabic NEs through rule-based systems. It is expected that good performance of NER will be effective to other modern fields like semantic web searching, question answering, machine translation, information retrieval, and abstracting systems

    The effectiveness of bottom up technique with probabilistic approach for a Malay parser

    Get PDF
    Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language

    The effectiveness of url features on phishing emails classification using machine learning approach

    Get PDF
    Phishing email classification requires features so that the performance obtained produces good accuracy. One of the reasons for the lack of development of models for detecting phishing emails is the complexity of the feature selection. Feature selection is one of the essential parts of getting a good classification result, commonly used features are header, body, and Uniform Resource Locator (URL). Besides the email body text content, the URL is one of the leading indicators that the phishing attack successfully happened. The URL is commonly located on the body of the phishing email to get the victim's attention. It will redirect the victim to a fake website to obtain personal information from the victim. There is a lack of information about how the URL features affect the phishing email classification results. Therefore, this work focuses on using URL features to determine whether an email is phishing or legitimate using machine learning approaches. Two public datasets used in this work are the Online Phishing Corpus and Enron Corpus. The URL features are extracted using the Beautiful Soup library. Two machine learning classifiers used in this work are Support Vector Machine (SVM) and Artificial Neural Network (ANN). The experiments were divided into two based on features used in the classifiers. The first experiment used raw email data with URL features, while the second only used raw email data. The first experiment shows higher accuracy in both classifiers, SVM and ANN. Hence, this research proves that the impact of selecting URL features will increase the performance of the classification

    Analysing the content of Web 2.0 documents by using a hybrid approach

    No full text
    User involvement in Web 2.0 has made a significant contribution to the increase in the amount of multimedia content on the Web. Images are one of the most used media, shared across the network to mark user experience in daily life. Interactive applications have allowed users to participate in describing these images, usually in the form of free text, thus gradually enriching the images' descriptions. Nevertheless, often these images are left with crude or no description. Web search engines such as Google and Yahoo provide text based searching to find images by mapping query concepts with the text description of the image, thus limiting the information discovery to material with good text descriptions. A similar issue is faced by text based search provided by Web 2.0 applications. Images with less description might not contain adequate information while images with no description will be useless as they will become unsearchable by a text based search. Therefore, there is an urgent need to investigate ways to produce high quality information to provide insight into the document content. The aim of this research is to investigate a means to improve the capability of information retrieval by utilizing Web 2.0 content, the Semantic Web and other emerging technologies. A hybrid approach is proposed which analyses two main aspects of Web 2.0 content, namely text and images. The text analysis consists of using Natural Language Processing and ontologies. The aim of the text analysis is to translate free text descriptions into a semantic information model tailored to Semantic Web standards. Image analysis is developed using machine learning tools and is assessed using ROC analysis. The aim of the image analysis is to develop an image classifier exemplar to identify information in images based on their visual features. The hybrid approach is evaluated based on standard information retrieval performance metrics, precision and recall. The example semantic information model has structured and enriched the textual content thus providing better retrieval results compared to conventional tag based search. The image classifier is shown to be useful for providing additional information about image content. Each of the approaches has its own strengths and they complement each other in different scenarios. The thesis demonstrates that the hybrid approach has improved information retrieval performance compared to either of the contributing techniques used separately

    Implementation of Kadazan Tagger Based on Brill's Method

    No full text
    We present and evaluate the implementation of Part of Speech (POS) Tagging for the Kadazan language by using the Transformation-based approach. The main purpose of this study is to develop an automatic POS tagging for the Kadazan language, which had never, been developed before. POS tagging can tag the Kadazan corpus automatically and can help reduce the disambiguation problem of this language. The implementation of this approach in this study is to achieve a better and higher accuracy or at least similar to that of the other tagging approaches such as the statistical and the original rule-based approach. This approach can transform the tags based on the prescribed set of rules. A number of objectives were set in order to achieve the main purpose of this study. Firstly, to apply the lexical and contextual rules for this language. Secondly, to implement the Brill's algorithm based on the set of rules and finally to determine the effectiveness of the Kadazan Part of Speech by using this approach. The tagging system had been trained using four Kadazan corpuses containing 5663 words in all. Based on the evaluation results, the tagging system had achieved around 93% accuracy

    THE DEVELOPMENT OF CULTURAL HERITAGE REPOSITORY BASED ON ONTOLOGY

    No full text
    Malaysia is known for its traditional cultural heritage. This is reflected from a diverge range of practices and ways of life of its people. Nevertheless, the traditional cultural heritage has been gradually forgotten overtime. The information related to traditional cultural heritage can be obtained from various sources and locations but is inconsistent thus resulting in information seeking difficulties. A repository of cultural heritage was constructed to overcome the problem. The repository was built based on ontology which enable semantic knowledge representation and searching. The ontology consists of tangible objects that are listed as national heritage objects by the National Heritage Department (JWN). Additional information is also collected from various sources, including the National Library, the National Museum, the history books, articles and online references. Classes for each object was identified using middle-out approach. An information retrieval system prototype was constructed to test the effectiveness of information management and access. The evaluation of the prototype shows that the ontology has improved the information management and information searching of tangible heritage. This cultural heritage repository could be extended as a main hub of national cultural heritage management and access

    ‘The mother made conscious’ : the historical development of a primary school pedagogy

    No full text
    With the increasing amount of multimedia content on the web added as user generated content in Web 2.0 websites, conventional multimedia information retrieval is presented with new challenges. It is no longer possible to rely only on meta-data based retrieval but to consider also content based techniques combined with the collective knowledge generated by users’ contributions and geo-referenced meta-data. Tagging is a modest way to annotate such documents and fails to capture a full semantic description of the document content. This report concerns ongoing research to investigate a means to identify, model and utilise semantic descriptions of the user-generated content in Web 2.0 documents using a hybrid approach. The approach consists of three main components, natural language processing, image analysis and a shared knowledge base. In this paper we describe the complete model but, as the image analysis component is in its early stages, the results focus on the natural language processing and the knowledge base. We show that the additional use of these components can improve retrieval and analysis performance over that based only on Web 2.0 tags
    corecore